GSoC 2015 : Week 7

This post is regarding the work done in the seventh week of the GSoC coding period. To know more details about the project follow the Introduction link.

This week I worked on enabling the couchdb-client to deal with HTTP connection as streams. The changes are yet to be merged. Follow this link to get an idea of the current PR that will be sent once I incorporate the feedback from my mentor which are there on the same page.  This would be the second set of changes from me to get merged from me to the couchdb-client project. To know about the first set of changes see the blog of Week 6.

Also I have read about the Drush plugin and will be staring to work on it soon. The replicator still needs some functional tests and a way to deal with failures. This would be done in the current week.

GSoC 2015 : Week 6

This post is regarding the work done in the sixth week of the GSoC coding period. To know more details about the project follow the Introduction link.

This week the entire work revolved around making code changes as per the mentor feedback for the pull request to the couchdb-client. The PR has now been merged and is my first such big contribution to the open source community. So definitely it feels good. 🙂 For more details follow this link describing the PR.

This was the first set of changes to be merged to the couchdb-client. Another set of changes that enable the client to deal with the HTTP connection as stream is yet to be sent for the pull request. This will be done in the current week.

GSoC 2015 : Week 1

This post is regarding the work done in the first week of the GSoC coding period. To know more details about the project follow the Introduction link.

Community bonding and initial days of the week-1 mostly involved reading about general PHP development, getting used to the IDE, setting up Comoser, learning to write and run PHPUnit tests, setting up and learning about CouchDB etc. My mentor Dixon pointed me to phptherightway, which is a great resource to learn about general PHP development. The couchdb-replicator needed to use a PHP based HTTP client and initially we were thinking of using Guzzle, as it has many features including sending parallel HTTP requests. However to speed up the work on the core issue which is the implementation of replication.io , Dixon suggested me to use the pre-existing Doctrine’s couchdb-client. It does not provide methods for all the HTTP API’s supported by the CouchDB. So I forked it and added the forked repository in my composer.json file. Whenever I need to make an HTTP request not supported by the client, I update the client with a method to support that particular request. After pushing the changes to Github, I run composer update for couchdb-replicator to get the latest client as the dependency. As of now, implementing Guzzle based couchdb-client is a low priority task. In this week I have added code and tests for the some of the initial phases of the replication protocol along with the needed changes in the couchdb-client. These phases are:

  • peer verification
  • getting peers information
  • finding common ancestry

GSoC 2015 : Week 2

This post is regarding the work done in the second week of the GSoC coding period. To know more details about the project follow the Introduction link.

This week I completed the locate changed documents phase of the replication protocol and wrote tests for it. With this now we can get the missing revisions that are there on the source but not on the target and only these need to be transferred. I was away for most part of this week but now I am fully available to jump on the remaining tasks.

After this the major replicate phase would start which would continue for most part of this month. Also Since the CouchDB Replication Protocol works on top of HTTP, which is based on TCP/IP, the Replicator should expect to be working within an unstable environment with delays, losses and other bad surprises that might eventually occur. The Replicator should not count every HTTP request failure as a fatal error. It should be smart enough to detect timeouts, repeat failed requests, be ready to process incomplete or malformed data and so on. These thing are of main concern and need to be discussed with mentors and CouchDB community members, before I move ahead.

GSoC 2015 : Week 3

This post is regarding the work done in the third week of the GSoC coding period. To know more details about the project follow the Introduction link.

This week I worked on implementing the replicate changes phase. Here the main challenge was to deal with the multipart/mixed responses from the Couchdb. Unlike any of the previous responses that I have dealt with, this is not a JSON data. So I wrote a basic parser myself which seems to be working.  I may need to improve upon it after I take a look at how one of the members of PouchDB community did that in JavaScript. After I completed the basic steps of the phase, I ran the completed replicator for the first time and WOAH..! , I was able to replicate multiple documents with image and text attachments. It felt good.. 😛

Now I will be improving the way the multipart data is being handled. For lesser memory footprint, I need to handle the data as a  stream where I can process it line by line, as we don’t want to hang our system by storing say a whole 10 GB attachment in the memory. Yield has been introduced in PHP 5.5 and I think it will ease my task to handle data line by line. This along with writing tests and documentation of all the changes made to the couchdb-client project will be the major work for the coming week.

GSoC 2015 : Week 5

This post is regarding the work done in the fifth week of the GSoC coding period. To know more details about the project follow the Introduction link.

It’s time for the midterm evaluation and I am running as per my timeline..! So I need not worry about it. 😛

This week I added some new features to the replicator and modified the code written for the couchdb-client, mostly involving removal of unnecessary commits and addition of tests, in order to make it suitable for pull request. The PR is yet to be merged and can be seen by following link [1].

Last week I had experimented on the streaming docs with attachments. This week I added the logic to the couchdb-client as a new MultipartClient which reads data from the source in chunks, processes it and transfers it to the client with desired modifications. It’s not very general now and supports only streaming the multipart  response from the source to the target. It will be modified as per the feedback of the maintainer of the couchdb-client and it will be done after I get the first set of changes merged. The multipart client and other set of related changes to use the MultipartClient can be seen by following the link [2].

The replicator now supports the continuous replication, which means that once a replication has been started started, the replicator-source and replicator-target connections will not end after the set of changes has been transferred. They will remain connected  and as soon as any change involving insertion, deletion or modification happens at the source, it will be transferred to the target. It has two variants, first where the replication never stops and a periodic heartbeat is sent continuously and the second where a max timeout can be set for waiting to close the connection before sending the response. Currently only the source-replicator connection remains opened.  The changes to support this can be seen by following the links in [3].

Now in the current week, I don’t plan to add any new features or to start the Drupal related part with writing Drush plugin. I will be mostly writing tests which my mentor greatly emphasizes upon 😛 , for the client and replicator. Also making changes to the client for the PR is another thing that I will work upon.

Links:

  1. couchdb-client/pull/42
  2. couchdb-client/tree/trying_generators/lib/Doctrine/CouchDB
    1. replicator: couchdb-replicator/tree/continuous_replication  ,
    2. client: couchdb-client/tree/continuous_replication/lib/Doctrine/CouchDB

GSoC 2015 : Week 4

This post is regarding the work done in the fourth week of the GSoC coding period. To know more details about the project follow the Introduction link.

This week I mostly spent time trying to find out how to stream data in PHP. I had heard about Guzzle, the PHP HTTP client and it’s good support for streams. But we wanted to do this without any external dependency and also with our initial decision of using the couchdb-client, I chose not to use Guzzle for now. One other way was using curl’s CURLOPT_READFUNCTION  option, which allows one to set a callback function returning chunks of data from a stream. I decided not to add new components to the current couchdb-client. So I used the file pointer returned by the fsockopen directly to read and write data in chunks. I requested the docs and attachment from the source and wrote it to the target, reading the stream from source line by line and writing it to the file pointer for the target’s connection. With this I was able to replicate a  ~150MB attachment with a memory limit for PHP as just 1MB.

Another issue that I faced while doing this is that initially I was trying to use “Transfer-encoding: chunked” as one of the header option while connecting to the target. This is used when you don’t know the entire length of the content you are sending, say in situations like where you are receiving data and you want to send it to the target without entirely storing it on you local machine may be because of memory issues or any other reasons. This is what I was doing. But after trying a lot, the doc and attachments were not getting replicated. So I talked to Kxepal, a couchdb member and came to know that this is an issue with CouchDb. It needs the Content-Length header and a fix was proposed but has not been merged yet. Wish I had talked to him before starting this.. So I read the stream till the attachment start, hoping that the doc will be small enough to fit in memory, then calculated the content length based on doc length, attachment length and other standard \r and \n’s. With this I was able to do stream handling of the response from source and upload it line by line with lesser memory footprint to the target, a much needed feature of the replicator..!

Now I need to see if it can be merged with the couchdb-client.

GSoC 2015: Content Staging Solution for Drupal 8

Introduction

Hi folks..! I am Abhishek Kumar, a CSE 2011 student of IIIT Hyderabad. This post will give updates about the work going on in my Google Summer Of Code 2015 project, which is being done under Drupal. Dick Olsson (dixon_) and Christian López Espínola (penyaskito) are my mentors for the project. The project is about content staging which is a highly desired feature in many circumstances. For example, where the content is developed inside a firewall protected network and then is pushed to the production site, and where content is being edited in separate development sites and staged to a central site. There are various issues involved in content staging and it is not straightforward. Content may have dependencies like authors and tags which also need be transferred when the content is replicated. There are also setups where a content workflow might cause conflicts between revisions of content, which needs to be handled by the system. The various modules for this system are under active development for Drupal 8. For example:

  • Multiversion module which extends Drupal’s content storage and revisioning model as well as handling update sequences for easier dependency management (unlike existing entity dependency module for Drupal 7 which recursively builds a complicated graph).
  • Relaxed Web Services module which is exposing Drupal content over a REST API that follows the same specification as CouchDB.
  • Deploy module which provides a UI.

These modules are not yet complete and are constantly evolving. Another important component of the system is the replicator which is responsible for transferring changes from the source site to the target site. Handling both the module development and implementing the replicator seems to me that it will not be possible within the GSoC period of 3 months, as the involved modules (multiversion) needs to deal with complex issues. Working on modules with the rest of community will be a filler task, as of now. So, my gsoc project is to develop the replicator, which:

  1. Will be a stand-alone PHP library.
  2. Will be based on the CouchDB replication protocol.
    1. Concise description: link
    2. Detailed description: link
  3. Will not depend on Drupal 8. The source and target of the replication can be any system that implements the CouchDB API specification.
  4. Initially, will be built and tested with CouchDB endpoints, due to the Drupal modules still being in development. In the 2nd half of the GSoC period the replicator will be tested with the Drupal modules to complete the full content staging solution.
  5. Will be integrated into Drupal with a Drush plugin.
  6. Will get a simple Drupal module with UI for managing replications.
  7. Will have full unit test coverage.

To know more about the work in this direction see this video of a presentation given by Dixon at DrupalCon.


Link to weekly blogs

Each week I would be writing a blog post describing the work done in that week. Use the links below to access blog posts for the work done till now.

  1. Week 1
  2. Week 2
  3. Week 3
  4. Week 4
  5. Week 5
  6. Week 6
  7. Week 7