PDA

View Full Version : [jdev] PubSub & News Feeds


Kelly S
05-24-2008, 06:54 PM
Hiya.

I'm getting back into the jabber swing of things after a few years of being away and I'm digging into PubSub!

I am planning to index a bunch of RSS/Atom etc via PubSub. I have read over some of the XEP and I am a bit confused about a few things so I would really appreciate some clarification :)

From what I can read on PubSub its clear how users subscribe to feeds and receive events but how can I handle a user requesting a feed that is not indexed on the server yet? So that the server can now start indexing that feed and begin Publishing new entries it pulls (I will be writting the service which starts pulling feeds off the web).

Also a bunch of concerns pop into my head that I'm still unclear about when maintaining all this "feed" data. 1. Is there any way I can just publish the latest feeds I pulled down and pubsub discard any duplicates that may already exist? Or is the right way to handle this is to query every single entry individually to check if they exist before publishing them? 1.a If I have to query each individually. Would I make the entry id the url of the entry so that I can check for duplicates as I pull them down off the web and read the entries? (example 4 old entries, but 1 new entry since last pull). 2. Entry requesting. Is there any sort of querying we can use against them? If we are publishing tons of entries, someone may want to browse/read them but only request X amount at a time, or only newer than a certain date etc.. I don't think pulling the entire entry history of a couple months is going to be too efficient. Some of my concerns for #2 is because we are going to mobilize this, data plans are expensive for mobile devices. In this country it can be $25/1.5MB/month. We want to try and make our mobile client query data efficiently but do these capabilities exist in Pub/Sub?
I'm sure I am just missing some concepts on how parts of PubSub work.
Thanks so much for your time :)
__________________________________________________ _______________
Try Chicktionary, a game that tests how many words you can form from the letters given. Find this and more puzzles at Live Search Games!
http://g.msn.ca/ca55/207
_______________________________________________
JDev mailing list
FAQ: http://www.jabber.org/discussion-lists/jdev-faq
Forum: http://www.jabberforum.org/forumdisplay.php?f=20
Info: http://mail.jabber.org/mailman/listinfo/jdev
Unsubscribe: JDev-unsubscribe (AT) jabber (DOT) org
_______________________________________________

Jehan
05-25-2008, 09:25 PM
Hiya.
From what I can read on PubSub its clear how users subscribe to feeds and receive events but how can I handle a user requesting a feed that is not indexed on the server yet? So that the server can now start indexing that feed and begin Publishing new entries it pulls (I will be writting the service which starts pulling feeds off the web).


Here as far as I understand, and remembering your previous post, you want to be able to transform any RSS publication into a xmpp node publication. So the point here is that your pubsub tree is "dynamic". From what I know, this is not in the basic XEP, because it implies a relation between the web and xmpp, which has not been planned (you wanted to tie the http url and the pubsub node, for instance a rss in http://subdomain.domain/some/path/feed/ would be in the node /subdomain.domain/some/path/feed/, didn't you?).
Yet this is probably very easy to implement such service on a server: when querying a subscription, it would check whether the address you entered is indeed linking to a rss feed, then creating the associated node for beginning to pull.

Also a bunch of concerns pop into my head that I'm still unclear about when maintaining all this "feed" data. 1. Is there any way I can just publish the latest feeds I pulled down and pubsub discard any duplicates that may already exist? Or is the right way to handle this is to query every single entry individually to check if they exist before publishing them?


Here you would do simply like any normal agregator already does: you check the time. You know the last time you got any news on a RSS, hence you stop reading the feed to this date.


Thanks so much for your time


You're welcome. ;-)

Kelly S
05-26-2008, 12:34 AM
Thanks for your time Jehan!

Sometimes I am unsure how to explain what I have in my head lol so let me try and clarify some more. Responses are found below.



> To: jdev (AT) jabber (DOT) org> From: list-jdev (AT) jabberforum (DOT) org> Date: Sun, 25 May 2008 21:25:27 +0200> Subject: Re: [jdev] PubSub & News Feeds> > > Kelly S;548 Wrote: > > Hiya.> > From what I can read on PubSub its clear how users subscribe to feeds> > and receive events but how can I handle a user requesting a feed that is> > not indexed on the server yet? So that the server can now start indexing> > that feed and begin Publishing new entries it pulls (I will be writting> > the service which starts pulling feeds off the web).> > > > Here as far as I understand, and remembering your previous post, you> want to be able to transform any RSS publication into a xmpp node> publication. So the point here is that your pubsub tree is "dynamic".> From what I know, this is not in the basic XEP, because it implies a> relation between the web and xmpp, which has not been planned (you> wanted to tie the http url and the pubsub node, for instance a rss in> http://subdomain.domain/some/path/feed/ would be in the node> /subdomain.domain/some/path/feed/, didn't you?).> Yet this is probably very easy to implement such service on a server:> when querying a subscription, it would check whether the address you> entered is indeed linking to a rss feed, then creating the associated> node for beginning to pull.>

I don't really need the node tree to reflect the url sub directory structure if thats what your mentioning in the "/subdomain.domain/some/path/feed/" path.

I've created a root node with the id of "feeds" and inside here is really where I just want to populate a whole ton of feeds. However I need to relate their web url somehow when trying to fetch these feeds because the "service" which is going to download & publish is going to need to know where to publish to.

I have been looking at some APIs of PubSub managers etc and many take node id as a string and I wonder if I put a url in there with "/" its going to think that is a sub node tree structure when all I really need is the following (I think?):

/feeds/someurl/entries*

I hope this makes sense. I'm not sure how else to describe it.
> > Also a bunch of concerns pop into my head that I'm still unclear about> > when maintaining all this "feed" data. 1. Is there any way I can just> > publish the latest feeds I pulled down and pubsub discard any duplicates> > that may already exist? Or is the right way to handle this is to query> > every single entry individually to check if they exist before publishing> > them? > > > > Here you would do simply like any normal agregator already does: you> check the time. You know the last time you got any news on a RSS, hence> you stop reading the feed to this date.>

So basically I can request PubSub to send me the latest 1 item, take that date, then pull the RSS feed off the web, and only publish items newer? This makes sense I think. I was hoping I didn't have to execute queries before pushing data as that would put more load on the XMPP service but if I am able to request just 1 latest item atleast thats minimizing the hit.
> > > > Thanks so much for your time > > > > You're welcome. ;-)>

Again thanks so much. I am quite confused on how to relate the node structure to feed urls so hopefully I can get this figured out.
> > -- > Jehan> ------------------------------------------------------------------------> Jehan's Profile: http://www.jabberforum.org/member.php?userid=16911> View this thread: http://www.jabberforum.org/showthread.php?t=149> > _______________________________________________> JDev mailing list> FAQ: http://www.jabber.org/discussion-lists/jdev-faq> Forum: http://www.jabberforum.org/forumdisplay.php?f=20> Info: http://mail.jabber.org/mailman/listinfo/jdev> Unsubscribe: JDev-unsubscribe (AT) jabber (DOT) org> _______________________________________________
__________________________________________________ _______________


_______________________________________________
JDev mailing list
FAQ: http://www.jabber.org/discussion-lists/jdev-faq
Forum: http://www.jabberforum.org/forumdisplay.php?f=20
Info: http://mail.jabber.org/mailman/listinfo/jdev
Unsubscribe: JDev-unsubscribe (AT) jabber (DOT) org
_______________________________________________

Jehan
05-26-2008, 02:36 PM
I don't really need the node tree to reflect the url sub directory structure if thats what your mentioning in the "/subdomain.domain/some/path/feed/" path.

I've created a root node with the id of "feeds" and inside here is really where I just want to populate a whole ton of feeds. However I need to relate their web url somehow when trying to fetch these feeds because the "service" which is going to download & publish is going to need to know where to publish to.

I have been looking at some APIs of PubSub managers etc and many take node id as a string and I wonder if I put a url in there with "/" its going to think that is a sub node tree structure when all I really need is the following (I think?):

/feeds/someurl/entries*

I hope this makes sense. I'm not sure how else to describe it.


Yes this makes sense. I had a fast look at the XEP. I don't see anything which tell which characters are allowed in an item id. Maybe there is somewhere, there are so many XEPs linked to each others, and linking one to another. Someone on this list who knows better the XEPs than me could probably answer. But maybe an id could simply be anything which is accepted as an XML parameter. In this case, you can probably use '/'. But I still need a confirmation from other people, so don't trust my writing!

And as anyway the node url and the item id are given separatly, I don't think this would cause major issue of distinguish them. Something like this to publish an item could maybe do the trick:


<iq type='set'
from='hamlet@denmark.lit/blogbot'
to='pubsub.shakespeare.lit'
id='publish1'>
<pubsub xmlns='http://jabber.org/protocol/pubsub'>
<publish node='/feeds/someurl/entries'>
<item id='http://subdomain.domain/some/path/rssfeed/'>
<entry xmlns='http://www.w3.org/2005/Atom'>
...
</entry>
</item>
</publish>
</pubsub>
</iq>


Yet the problem with this method is that you have only one item for one rss feed. Is this what you want? I was rather thinking of one leaf node for one rss feed, and then one item in this node for every rss item.
And in this case, you could transform the url http://subdomain.domain/some/path/rssfeed into the pubsub node: /feeds/someurl/entries/subdomain.domain/some/path/rssfeed/
Then inside this leaf node, you can publish all item you want.


So basically I can request PubSub to send me the latest 1 item, take that date, then pull the RSS feed off the web, and only publish items newer? This makes sense I think. I was hoping I didn't have to execute queries before pushing data as that would put more load on the XMPP service but if I am able to request just 1 latest item atleast thats minimizing the hit.


Yes that's it. I think this is the way agregators work. And what you want to do is finally just some kind of dynamic agregator transforming rss feeds into Jabber feeds.
Anyway I don't think that you can really avoid queries, because this is basically the problem of pull systems: you will always have to query data in order to know simply whether or not you need to do something! This is basically unefficient!
Your system will bring the advantages of Jabber feed, but I don't think it can really delete the flaws (efficiency and non-realtime) of RSS as it still relies on it. To do so, you still need a pure Jabber notification from the beginning to the end.

Anyway as long as it is not spread enough (but that RSS is), your system can probably be better than nothing. :-)

Peter Saint-Andre
05-27-2008, 11:46 PM
On 05/26/2008 6:36 AM, JabberForum wrote:
> Kelly S;562 Wrote:
>> I don't really need the node tree to reflect the url sub directory
>> structure if thats what your mentioning in the
>> "/subdomain.domain/some/path/feed/" path.
>>
>> I've created a root node with the id of "feeds" and inside here is
>> really where I just want to populate a whole ton of feeds. However I
>> need to relate their web url somehow when trying to fetch these feeds
>> because the "service" which is going to download & publish is going to
>> need to know where to publish to.
>>
>> I have been looking at some APIs of PubSub managers etc and many take
>> node id as a string and I wonder if I put a url in there with "/" its
>> going to think that is a sub node tree structure when all I really need
>> is the following (I think?):
>>
>> /feeds/someurl/entries*
>>
>> I hope this makes sense. I'm not sure how else to describe it.
>>
>
> Yes this makes sense. I had a fast look at the XEP. I don't see
> anything which tell which characters are allowed in an item id.

"If a pubsub node is addressable as a JID plus node, the pubsub service
SHOULD ensure that the NodeID conforms to the Resourceprep profile of
Stringprep as described in RFC 3920."

So basically you can include just about any character in a NodeID.

BTW, NodeIDs do not have semantic meaning (they are opaque as far as the
spec is concerned), but if a given deployment wants to attribute meaning
to "/" as a hierarchical separator (or whatever) then that's their business.

And remember that not everything needs to be defined at the level of
XEP-0060. You can define your own semantics on top of that in your own
application, without requiring any modifications to XEP-0060 itself. Or
so we hope. ;-)

Peter

--
Peter Saint-Andre
https://stpeter.im/


_______________________________________________
JDev mailing list
FAQ: http://www.jabber.org/discussion-lists/jdev-faq
Forum: http://www.jabberforum.org/forumdisplay.php?f=20
Info: http://mail.jabber.org/mailman/listinfo/jdev
Unsubscribe: JDev-unsubscribe (AT) jabber (DOT) org
_______________________________________________

Jehan
05-28-2008, 12:48 AM
And remember that not everything needs to be defined at the level of
XEP-0060. You can define your own semantics on top of that in your own
application, without requiring any modifications to XEP-0060 itself. Or
so we hope. ;-)

Peter


Hi,

even though I have some idea about this, could you give an example (even a bad one conceptually, just to have a concrete example) about this, for instance in the context of what Kelly wants to do please?
Thanks.

Jehan

Peter Saint-Andre
06-03-2008, 11:48 PM
On 05/27/2008 4:48 PM, JabberForum wrote:
> Peter Saint-Andre;605 Wrote:
>> And remember that not everything needs to be defined at the level of
>> XEP-0060. You can define your own semantics on top of that in your own
>> application, without requiring any modifications to XEP-0060 itself.
>> Or
>> so we hope. ;-)
>>
>> Peter
>>
>
> Hi,
>
> even though I have some idea about this, could you give an example
> (even a bad one conceptually, just to have a concrete example) about
> this, for instance in the context of what Kelly wants to do please?
> Thanks.

The old pubsub.com service is a good example. They build a nice web
interface for choosing your feeds. Most of those used keyword matching
(e.g., send me all entries that match either "Jabber" or "XMPP") but you
didn't have to worry about the exact NodeID syntax over XMPP, they did
that in their application layer above XEP-0060.

Peter

--
Peter Saint-Andre
https://stpeter.im/


_______________________________________________
JDev mailing list
FAQ: http://www.jabber.org/discussion-lists/jdev-faq
Forum: http://www.jabberforum.org/forumdisplay.php?f=20
Info: http://mail.jabber.org/mailman/listinfo/jdev
Unsubscribe: JDev-unsubscribe (AT) jabber (DOT) org
_______________________________________________